In this project, I will explore a data set on wine quality and the corresponding chemical contents.
First, let’s run some basic function to examine the structure and schema of the data.
Number of Observations and variables:
## [1] 6497 14
Field names:
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "color"
Few lines of the data and summary:
## 'data.frame': 6497 obs. of 14 variables:
## $ X : int 4898 4895 4894 4893 4892 4891 4889 4883 4882 4880 ...
## $ fixed.acidity : num 6 6.6 6.2 6.5 5.7 6.1 6.8 5.5 5 6.6 ...
## $ volatile.acidity : num 0.21 0.32 0.21 0.23 0.21 0.34 0.22 0.32 0.235 0.34 ...
## $ citric.acid : num 0.38 0.36 0.29 0.38 0.32 0.29 0.36 0.13 0.27 0.4 ...
## $ residual.sugar : num 0.8 8 1.6 1.3 0.9 ...
## $ chlorides : num 0.02 0.047 0.039 0.032 0.038 0.036 0.052 0.037 0.03 0.046 ...
## $ free.sulfur.dioxide : num 22 57 24 29 38 25 38 45 34 68 ...
## $ total.sulfur.dioxide: num 98 168 92 112 121 100 127 156 118 170 ...
## $ density : num 0.989 0.995 0.991 0.993 0.991 ...
## $ pH : num 3.26 3.15 3.27 3.29 3.24 3.06 3.04 3.26 3.07 3.15 ...
## $ sulphates : num 0.32 0.46 0.5 0.54 0.46 0.44 0.54 0.38 0.5 0.5 ...
## $ alcohol : num 11.8 9.6 11.2 9.7 10.6 ...
## $ quality : int 6 5 6 5 6 6 5 5 6 6 ...
## $ color : chr "White" "White" "White" "White" ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.: 813 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500
## Median :1650 Median : 7.000 Median :0.2900 Median :0.3100
## Mean :2044 Mean : 7.215 Mean :0.3397 Mean :0.3186
## 3rd Qu.:3274 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900
## Max. :4898 Max. :15.900 Max. :1.5800 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 1.00
## 1st Qu.: 1.800 1st Qu.:0.03800 1st Qu.: 17.00
## Median : 3.000 Median :0.04700 Median : 29.00
## Mean : 5.443 Mean :0.05603 Mean : 30.53
## 3rd Qu.: 8.100 3rd Qu.:0.06500 3rd Qu.: 41.00
## Max. :65.800 Max. :0.61100 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.: 77.0 1st Qu.:0.9923 1st Qu.:3.110 1st Qu.:0.4300
## Median :118.0 Median :0.9949 Median :3.210 Median :0.5100
## Mean :115.7 Mean :0.9947 Mean :3.219 Mean :0.5313
## 3rd Qu.:156.0 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000
## Max. :440.0 Max. :1.0390 Max. :4.010 Max. :2.0000
## alcohol quality color
## Min. : 8.00 Min. :3.000 Length:6497
## 1st Qu.: 9.50 1st Qu.:5.000 Class :character
## Median :10.30 Median :6.000 Mode :character
## Mean :10.49 Mean :5.818
## 3rd Qu.:11.30 3rd Qu.:6.000
## Max. :14.90 Max. :9.000
Let’s look at the quality distribution now[how they rated the wine]
The quality of wine has a slightly skewed normal distribution. Most wine were rated as 5 or 6. The lowest rating is 3 and the highest rating is 9.
We would like to plot each individual factors and try to find their potential influence on wine quality.
Then, let’s look at the most understandable item — alcohol:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.30 10.49 11.30 14.90
The minimum alcohol content of the sample is 8% and the maximum alcohol content is 14.9%. Mean alcohol content is 10.49. The alcohol has a skewed normal distribution.
Let’s look at another common item — residual sugar:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 3.000 5.443 8.100 65.800
Unlike the alcohol, the range of residual sugar is great. From the classification of wikipedia— dry wine has a sweetness less than 4 g/L, Medium Dry has 4-12 g/L, Medium is 12-45 g/L, Sweet is greater than 45 g/L.
Based upon the definition, the percentage of each type can be seen in the following graph:
##
## Dry Medium Medium Dry Sweet
## 3571 833 2092 1
Most samples are dry wine and only a barely visible portion is sweet wine.
let’s look at acids group:
The fixed.acids have normal distribution. Volatile.acids and citric acids have skewed distribution.
Then we would like to see the chlorides and sulphates:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03800 0.04700 0.05603 0.06500 0.61100
The chlorides has a skewed distribtuion and a few significant outliers.The maximum value is nearly 10 times of the 3rd Qu value.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4300 0.5100 0.5313 0.6000 2.0000
The sulphates also has a skewed distribtuion, however, the ourlier is not as significant as chlorides. The maximum value is 2g/L and minimum value is 0.22 g/L.
Due to the nature of the description, the (11) factors can be classified as following: 1. Acids 2. Sugar 3. Alcohol 4. Chlorides 5. Sulphates
We will mainly examine these (5) factors and their relationship to quality.
There are 6497 observations of 14 variables (X,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality,color). Quality is an ordered, categorical, discrete variable. It was on a 0-10 scale, rated by at least 3 wine experts. The values ranged only from 3 to 9, with a mean of 5.818 and median of 6. X is the numbering system for the wine samples. Color was a created categorical factor. All other variables are all quantitative factors about the chemical content in wine.
The main feature of interest is the factors affecting the quality of red/white wine. I suspected that the alcohol, residual.sugar and PH will affect the quality of red/white wine. The other point of interest is the difference between red/white wine.
From the description of the variables, it seems that the fixed.acidity & volatile.acidity, free.sulfur.dioxide & total.sulfur.dioxide, alcohol & density can be corralated variables.
Yes, ‘color’ was the created new variables.
Factors like residual.sugar/free.sulfur.dioxide has significant outliers. However, considering the unit used, the outliers can be accepted and the data is tidy data.
From the nature of the chemicals, let’s examine the correlation group by group. The first group is about acids, pH and quality:
The acids does not have strong relationship with quality. Among the factors, volatile.acids has the greater R (-0.266). Surprisingly, the volatile.acids has negative correlation with citric.acid, and pH (a log scale acidity) has positive relationship to volatile.acids. We wil examine this relations further.
That’s then examine the second group of factors, residual sugar, alcohol, density, sulphate, chloride and quality.
This result gets along well with our physics knowledge, the sugar add to density and alcohol content reduces density. Among the factors, alcohol and chlorides are most critical independent factors (density can be seen as a dependent factor) that influence quality.
Let’s see the last group of data.
The last group shows a rather weak relationship. The type/color does not seem to influence the quality and all three other factors have weak relationship with quality. The strongest relationship is between free.sulfur.dioxide and total.sulfur.dioxide. However, it can be seen from their names…
Now let’s examine the key factors, alcohol, chlorides, and volatile.acids and their relationship to quality.
The correlation factor between alcohol and quality is positve. (R=0.4443185)
The correlation factor between chlorides and quality is negative. (R=-0.2006655)
The correlation factor between volatile.acidity and quality is negative. (R=-0.2656995)
We will also exmine the difference between red and white wine. Here I will include one more factor I am intersted in —the residual sugar.
The difference of alcohol between white and red wine is not significant.
By removing the outlier, we can see that the average chlorides content in red wine is higher than what in white wine.
From this figure, the acids in red wine is approximately twice as in white wine. We can say that red wine is more sour.
From all above, the alcohol has positive relationship with quality, while chlorides and volatile.acidity will decrease the quality.
Among red and white wine, white wine has less volatile.acidity and chlorides, more sugar, and a slightly higher alcohol content.
Density has a negative relationship with alcohol. It also has positive correlation with residual sugar. The correlation coefficients are -0.687 and 0.553 respectively.
The white wine tend to have more alcohol, more residual sugar and less acids, chlorides.
As it has been assumed in section 1, there are some instinct relationship between the variables. For example, the free.sulfur.dioxide and total.sulfur.dioxide are highly correlated. pH has negative relationship with acids.
The strongest relationship is between density and alhocol (R=-0.687), which makes sense because alhocol has smaller desity than water (desity = 49.3 lb/ft^3 and 62.4 lb/ft^3)
In both red and white wine, the alcohol positively influence the quality.
## $title
## [1] "Acids vs. Quality for Red and White Wine"
##
## attr(,"class")
## [1] "labels"
In both red and white wine, the volatile.acidity negatively influence the quality. However, the red wine is more sensitive while the relationship between volatile.acidity and quality of white wine is relatively weak.
In both red and white wine, the chlorides negatively influence the quality, although red wine has higher chlorides content in every level of rating.
In the last section, we have examined the relationship between residual sugar and quality. White wine has a slightly negative relationship while red wine has a positive relationship.
It can be inferred that, we expect a high quality wine more “sweet”" while white wine less “sweet”.
In this section, I found that the relationships of quality to alcohol,chlorides, and volatile.acids are different among red and white wine.
The standards used to judge the quality of red wine and white wine are different. For red wine, the residual sugar has a positive relationship with the quality. However, for white wine, it is negatively related to the quality.
volatile.acidity has a negative effect in Red wine but White wine is not sensitive to volatile.acidity.
Both wine show some trend under the influence of alcohol and chlorides.
This figure shows the distribution of wine ratings. Among the 1599 red wine sample and 4898 white wine sample, most samples were rated as 5 or 6, 2000+ and 2800+ respecively. The quality of wine has a slightly skewed normal distribution. The lowest rating is 3 and the highest rating is 9.In the samples rated under 6, the red wine takes about one third portion. However, in high-rated samples, red wine takes a much smaller portion.
This picture depicts the difference between red and white wine. Red wine has more acids, more chlorides, less sugar and slightly less alcohol. The greatest difference from the figure is the volatile acidity, the red wine has an average of 0.5 g/L while white wine only has 0.2 g/L. All group of data has a few significant outliers.
volatile.acidity has high impact on quality. This figure depicts on how red and white wine behave differently in terms of the content of volatile.acidity. For red wine, the average volatile.acidity content is higher than white wine, and the quality is more sensitive to the change of the volatile.acidity. Overall, volatile.acidity has positive relationship with the wine quality.
“The biggest difference between reds and whites is in how they’re made. The grapes used for red and white wines generally look very different—as you might imagine, red wine grapes are darker and have more pigment. When making white wine, typically the grapes are pressed and then just the juice is fermented.”1
The nature and brewing processes made the telling difference. Through the data, we looked into the differences between red and white wine from their chemical contents. Compared to the red wine, the white wine tend to have higher alcohol, more residual sugar and less acids, less chlorides (probably because of the brewing process).
Some facotrs affecting quality also differed in red and white wine. Residual sugar and acids made positive contribution to the quality but they will decrease the taste for white wine. Sulphate positively influenced the red wine quality but white wine seems to be insensitive to this chemical. Both wine proned to rate higher alcohol content as better quality.
After all, quality rating is a relatively subjective factor. Human-beings, even the experts have their limits in distinguishing the tiny difference between each sample, not mentioned the consumers. That’s probably why most wine were rated as 5 or 6. If more extreme cases (below 3 or greater than 8) can be gathered, I would be interested to see why those samples stand out as unique.
Reference: 1. http://www.winespectator.com/drvinny/show/id/44697 2. https://en.wikipedia.org/wiki/Sweetness_of_wine